Method and system for determining the importance of individual variables in a statistical model

ABSTRACT

A method and system for determining the importance of each of the variables that contribute to the overall score of a model for predicting the profitability of an insurance policy. For each variable in the model, an importance is calculated based on the calculated slope and deviance of the predictive variable. Since the score is developed using complex mathematical calculations combining large numbers of parameters with predictive variables, it is often difficult to interpret from the mathematical formula for example, why some policyholders receive low scores while other receive high scores. Such clear communication and interpretation of insurance profitability scores is critical if they are used by the various interested insurance parties including policyholders, agents, underwriters, and regulators.

RELATED APPLICATION DATA

This application is a continuation of U.S. Ser. No. 09/996,065, filed Nov. 28, 2001, the contents of each of which are hereby incorporated by reference in their entirety.

BACKGROUND OF THE INVENTION

The present invention is directed to a method and system for evaluating the results of a predictive statistical scoring model and more particularly to a system and method that determines the contribution of each of the variables that comprise the predictive scoring model to the overall score generated by the model.

Insurance companies provide coverage for many different types of exposures. These include several major lines of coverage, e.g., property, general liability, automobile, and workers compensation, which include many more types of sub-coverage. There are also many other types of specialty coverages. Each of these types of coverage must be priced, i.e., a premium selected that accurately reflects the risk associated with issuing the coverage or policy. Ideally, an insurance company would price the coverage based on a policyholder's actual future losses. Since a policyholder's future losses can only be estimated, an element of uncertainty or imprecision is introduced in the pricing of a particular type of coverage such that certain policies are priced correctly, while others are under-priced or over-priced.

In the insurance industry, a common approach to pricing a policy is to develop or create complex scoring models or algorithms that generate a value or score that is indicative of the expected future losses associated with a policy. The predictive scoring models are used to price coverage for a new policyholder or an existing policyholder. As is known, multivariate analysis techniques such as linear regression, nonlinear regression, and neural networks are commonly used to model insurance policy profitability. A typical insurance profitability application will contain many predictive variables. A profitability application may be comprised of thirty to sixty different variables contributing to the analysis.

The potential target variables in such models can include frequency (number of claims per premium or exposure), severity (average loss amount per claim), or loss ratio (loss divided by premium). The algorithm or formula will directly predict the target variable in the model. The scoring formula contains a series of parameters that are mathematically combined with the predictive variables for a given policyholder to determine the predicted profitability or final score. Various mathematical functions and operations can be used to produce the final score. For example, linear regression uses addition and subtraction operations, while neural networks involve the use of functions or options that are more complex such as sigmoid or hyperbolic functions and exponential operations.

In creating the predictive model, often the predictive variables that comprise the scoring formula or algorithm are selected from a larger pool of variables for their statistical significance to the likelihood that a particular policyholder will have future losses. Once selected from the larger pool of variables, each of the variables in this subset of variables is assigned a weight in the scoring formula or algorithm based on complex statistical and actuarial transformations. The result is a scoring model that may be used by insurers to determine in a more precise manner the risk associated with a particular policyholder. This risk is represented as a score that is the result of the algorithm or model. Based on this score, an insurer can price the particular coverage or decline coverage, as appropriate.

As noted, the problem of how to adequately price insurance coverage is challenging, often requiring the application of complex and highly technical actuarial transformations. These technical difficulties with pricing coverages are compounded by real world marketplace pressures such as the need to maintain an “ease-of-business-use” process with policyholders and insurers, and the underpricing of coverages by competitors attempting to buy market share. Notwithstanding the recognized value of these pricing models and their simplicity of use, known models provide insurers with little information as to why a particular policyholder received his or her score. Consequently, insurers are unable to advise policyholders with any precision as to the reason a policyholder has been quoted a high premium, a low premium, or why, in some instances, coverage has been denied. This leaves both insurers and policyholders alike with a feeling of frustration and almost helpless reliance on the model that is used to determine pricing.

While predictive scoring models are available in the insurance industry to assist insurers in pricing insurance coverage, there is still a need for a method and system to that overcomes the foregoing shortcomings in the prior art. Accordingly, there exists a need for a system and method to interpret the results of any scoring model used in the insurance industry to price coverage. Indeed, the system and method may be used to interpret the results of any complex formula. There is especially a need for a system and a method that allow an insurer to determine and rank the contribution of each of the individual predictive variables to the overall score generated by the scoring model. In this manner, insurers and policyholders alike may know with certainty the factors or variables that most influenced the premium paid or price of an insurance policy.

SUMMARY OF THE INVENTION

It is an object of the present invention to address and overcome the deficiencies of the prior art by providing a system and a method for interpreting the results of a scoring model used to price insurance coverage.

It is another object of the invention to provide a system and a method that determine the significance or contribution of each predictive variable to the score generated by such a scoring model.

It is another object of the invention to provide a system and a method that permit insurers to rank the variables according to their significance or contribution to such overall score.

It is still another object of the present invention to provide a system and method that allow insurers to utilize the rank information to inform potential or existing policyholders of those variables that most influenced or affected the pricing.

Accordingly, in one aspect of the invention a method is provided of evaluating the scoring formula or algorithm to determine the contribution of each of the individual predictive variables to the overall score generated by the scoring model. For example, in the commercial auto industry, sophisticated scoring models are created for predicting the profitability of issuing a particular policy based on variables that have been determined to be predictive of profitability. These predictive variables may include the age of the vehicle owner, total number of drivers, speeding violations and the like. In a scoring algorithm having over a dozen variables, the analysis or their individual contributions to the overall score would be very difficult without the present invention.

In another aspect of the present invention, in a system that employs a statistical model comprised of a scoring formula having a plurality of predictive variables for generating a score that is representative of a risk associated with an insurance policyholder and for pricing a particular coverage based on the score, a method is provided of quantifying the contribution of each of the plurality of predictive variables to the score generated by the model including the steps of populating a database associated with the system with a mean value and standard deviation value for each of the plurality of variables, calculating a slope value for each of the plurality of variables, calculating a deviance value based on the slope and standard deviation for each of the plurality of variables, and multiplying the deviance value and slope value for each of the plurality of variables to quantify the contribution of each of the plurality of variables to the score. This quantified contribution may then be used to rank the variables by importance to the overall score.

Additional objects, features and advantages of the invention appear from the following detailed disclosure.

The present invention accordingly comprises the various steps and the relation of one or more of such steps with respect to each of the others, and the product which embodies features of construction, combinations of elements, and arrangement of parts which are adapted to effect such steps, all as exemplified in the following detailed disclosure, and the scope of the invention will be indicated in the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

For a fuller understanding of the invention, reference is made to the following description, taken in connection with the accompanying drawings, in which:

FIG. 1 illustrates a system that may be used to interpret and rank the predictive variables according to an exemplary embodiment of the present invention;

FIG. 2 is a flow diagram depicting the steps carried out in interpreting the contribution of each of the predictive external variables in a scoring model according to an exemplary embodiment of the present invention;

FIG. 3 specifies the description of the variables in an example illustrating the application of the method of the present invention to an exemplary scoring formula;

FIG. 4 specifies assumptions made regarding the variables in the exemplary scoring formula; and

FIG. 5 specifies the values for the variables used in the exemplary scoring formula, the application of the method of the present invention and the results thereof.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention described herein creates an explanatory system and method to quantitatively interpret the contribution or significance of any particular variable to a policyholder's profitability score (hereinafter the “Importance”). The methodology of the present invention considers both a) the overall impact of a variable to the scoring model as well as b) the particular value of each variable in determining its Importance to the final score.

As is known, scoring models are developed and used by the insurance industry (as well as other industries) to set an ideal price for a coverage. Many off-the-shelf statistical programs and applications are known to assist developers in creating the scoring models. Once created relatively standard or common computer hardware may be used to store and run the scoring model. FIG. 1 illustrates an exemplary system 10 that may be employed to implement a scoring model and calculate the Importance of individual predictive variables according to an exemplary embodiment of the present invention. Referring to FIG. 1, the system includes a database 20 for storing the values for each of the variables in the scoring formula, a processor 30 for calculating the target variable in the scoring algorithm as well as the values associated with the present invention, monitor 40 and input/output means 50 (i.e., keyboard and mouse). Alternatively, the system 10 may housed on a stand alone personal computer having a processor, storage means, monitor and input/output means.

Referring to FIG. 2, the steps of a method according to an exemplary embodiment of the present invention are shown generally as 100. The method assumes a model has been generated utilizing one of many statistical and actuarial techniques briefly discussed herein and known in the art. The model is typically a scoring formula or algorithm comprised of a plurality of weighted variables. The database 20 is populated with values for the variables that define the scoring model. These values in the database are used by the scoring model to generate the profitability score. It should be noted that some of the values might be supplied as a separate input from an external source or database.

Similarly, in step 101, the database 20 or a different database is populated with values for the population mean and standard deviation for each of the predictive variables. These values will be used in calculating the Importance as will be described. Next, in step 102, the slope for each predictive variable in the scoring model is determined As discussed below, this may be simply done in a scoring mode or require a separate calculation. In step 103, a deviance is calculated. After the deviance is calculated, in step 104, the Importance is calculated for each variable by multiplying the slope by the deviance. The variables are then ranked by Importance in step 105. The higher the value the more important the variable was toward the overall profitability score.

Steps 102 through 104 are now explained in more detail:

Step 102

The first criterion in determining the most important variables for a particular score is the impact that each variable contributes to the overall scoring formula. Mathematically, such impact is given by the slope of the scoring function with respect to the variable being analyzed. To calculate the slope, the first derivative of the formula with respect to the variable is generated. For a nonlinear profitability formula such as a neural network formula or a nonlinear regression formula, the slope may be different from one data point (i.e. policyholder) to the next. Therefore, the average of the slope across all of the data points is used as the first criteria to measure Importance.

Since the first derivative can be either positive or negative for each data point and since the impact should be treated equally regardless of the sign of the slope, it is necessary to calculate the average of the first derivative and then take the absolute value of the average. In summary, the first criteria in determining the most important variables can be represented as follows:

${{Slope}\mspace{14mu} {of}\mspace{14mu} {Predictive}\mspace{14mu} {Variable}\mspace{14mu} x_{i}} = {{{avg}\left( \frac{\partial{F(X)}}{\partial x_{i}} \right)}}$

(where F(X) is the scoring function which depends on a number of predictive variables, x_(i), i=1, 2, 3 . . . n.b).

This technique is also directly applicable to the linear regression model results. However, in a linear regression model, the slope of a variable is constant (same sign and same value) across all of the data points and therefore the average is simply equal to the value of the slope at any particular point.

Step 103

Although the slope impact of a predictive variable as determined in Step 102 is applied to every data point, it is expected that the Importance of any particular variable will be different from one data point to another. Therefore, the overall Importance of a variable should include a measure of its value for each specific policyholder as well as the overall average value determined in Step 102. For example, if the value of a variable deviates “significantly” from the general population mean for a given policyholder, the conclusion might be that the variable played a significant role in determining why that policy received its particular score. On the other hand, if the value of a particular variable for a chosen policy is close to the overall population mean, it should not be judged to have an influential impact on the score, even if the average value of the variable impact (from Step 102) is large, because its value for that policy is similar to the majority of the population.

Therefore, the second criterion in measuring Importance, Deviance, is a measure of how similar or dissimilar a variable is relative to the population mean. Deviance may be calculated using the following formula:

${{Deviance}\mspace{14mu} {of}\mspace{14mu} x_{i}} = \frac{\left( {x_{i} - \mu_{i}} \right)}{\sigma_{i}}$

where μ_(i) is the mean for x_(i) and σ_(i) is the standard deviation for predicitve variable x_(i).

Step 104

The final step, 105, defines the importance of a predictive variable as the product of the slope (Step 1) and the deviance (Step 2) of the variable:

Importance=Slope*Deviance

For each policy that is scored, the Importance of each variable is calculated according to the above methodology. The predictive variables are then sorted for every policy according to their Importance measurement to determine which variables contributed the most to the predicted profitability.

Referring to FIGS. 3 through 5, the Importance calculation is applied to an exemplary situation illustrating the usage of the proposed Importance calculation in a typical multivariate auto insurance scoring formula. In the example, the following should be assumed: (i) a personal automobile book of business is being analyzed, and (ii) the book has a large quantity of data, e.g., 40,000 data points, available for the analysis. In this example, a linear regression formula is used for its simplicity. As described in more detail below, the scoring formula is given as follows:

Y = 0.376 + 0.0061 X₁ − 0.0106 X₂ + 0.00593 X₃ − 0.00334 X₄ + 0.011 X₅ + 0.075 X₆ + 0.049 X₇ + 0.027 X₈ + 0.0106 X₉ + 0.061 X₁₀ − 0.00242 X₁₁ − 0.062 X₁₂ + 0.0109 X₁₃ + 0.000403 X₁₄ − 0.00194 X₁₅ − 0.0017 X₁₆ + 0.000704 X₁₇

In the above scoring formula, the target variable, Y, will predict the loss ratio (loss/premium) for a personal automobile policy. A multivariate technique, which can be a traditional linear regression or a more advanced nonlinear technique such as nonlinear regression or neural networks, was used to develop the scoring formula. The formula uses seventeen (17) driver and vehicle characteristics to predict the loss ratio, which are described in FIG. 3.

Any assumptions made for the variables are specified in FIG. 4. For each variable, the information gives a further description of the possible values for each variable based on the total population of the data points used in the model development and stored in database 20. Additionally, FIG. 4 specifies the Mean of the modeling data population and Standard Deviation for each variable.

This example illustrates a “bad” (predicted to be unprofitable) policy having the values for the particular variables specified in FIG. 5. The scoring formula contains a constant term, 0.376, and a parameter for each predictive variable. When the parameter is positive, it indicates that the higher the variable, the higher the Y and hence the worse the predicted profitability. When the parameter is negative, it indicates the opposite. For example, the parameter for vehicle age, X₂, is −0.0106. This suggests that the older the vehicle, the lower the Y and the better the profitability. It also suggests that as the vehicle age increases by 1 year, the Y will decrease by 0.0106. On the other hand, the parameter for the number of minor traffic violation, X₅, is 0.011. This suggests that the more the violations, the higher the Y and the worse the profitability. It also suggests that as the number of the violation increases by one, the Y will increase by 0.011.

Referring to FIG. 5, the solution of the model indicates that the policy has a predicted loss ratio score of 1.19, which is more than twice the population average of 0.54. A close review of the seventeen (17) predictive variables further indicates that it has many bad characteristics. For example, it has a number of accidents and violations (X₅, X₆, X₉). It also has a very high number of safety surcharge points (X₄) as well as a bad financial credit score (X₁₄). Also, the vehicle is very expensive (X₁) and the driver is relatively young (X₁₁).

While the policy is obviously a bad policy, the unanswered question is which of the seventeen (17) variables are the key driving factors for the bad score? Are the ten (10) driver safety points the number one reason, or the three (3) major violations the number one reason for such a bad score? In addition, what are the top 5 most important reasons? In order to address these questions, the Importance of each variable is calculated using the method described above and in FIG. 2. The first step (102) is to calculate the slope of each predictive variable:

${{Slope}\mspace{14mu} {of}\mspace{14mu} {Predictive}\mspace{14mu} {Variable}\mspace{14mu} x_{i}} = {{{avg}\left( \frac{\partial{F(X)}}{\partial x_{i}} \right)}}$

Since the scoring formula used in the example is a linear formula, the slope is the same as the parameter or coefficient preceding each variable in the scoring formula, as illustrated in column 3 of FIG. 5. The next step (103) is to calculate the deviance for each predictive variable:

${{Deviance}\mspace{14mu} {of}\mspace{14mu} x_{i}} = \frac{\left( {x_{i} - \mu_{i}} \right)}{\sigma_{i}}$

where μ_(i) is the mean for x_(i) and σ_(i) is the standard deviation for predicitve variable x_(i).

The value (X_(i)) for each variable for the sample policy is given in the second column, and the population mean and the population standard deviation are given in columns 3 and 4 of FIG. 4. The calculated slope and deviance for each variable are shown in columns 3 and 4, respectively, of FIG. 5. The next step (104) is to calculate the Importance, which is the product of slope and deviance. The calculated importance is given in column 5 of FIG. 5. In a final step (105), from the calculated value of the Importance, the variables can be ranked from highest to lowest value as shown in column 6 of FIG. 5.

The ranking is directly translated into a reasons ranking. From column 6, it can be see that the most important reason why the sample policy is a “bad” policy is because the policy has three major traffic (X₁₀) violations, compared to the average 0.11 violations for the general population. The second most important reason is that the policy has two no-fault incidences (X₆), while the general population on average only has 0.1 violations.

When these two variables are compared to the other fifteen (15) variables, it becomes clear that this policy has values for these two variables that are very different from the general population, as indicated by the high value of deviance. In addition, the parameters (the slopes) for these two variables are also very high, indicating that both variables have a significant impact on the predicted loss ratio and profitability of the policy. In the case of these two variables, the high values of both the slope and the deviance causes these two variables to emerge as the top two most Important factors to explain the bad score for the policy.

With the foregoing method and system an easy-to-understand explanation of which variables are most significant to the score (i.e., Importance) is made available to non-technical end users. Such clear communication and interpretation of insurance profitability scores is critical if they are used by the various interested insurance parties including policyholders, agents, underwriters, and regulators.

In so far as embodiments of the invention described herein may be implemented, at least in part, using software controlled programmable processing devices, such as a computer system, it will be appreciated that one or more computer programs for configuring such programmable devices or system of devices to implement the foregoing described methods are to be considered an aspect of the present invention. The computer programs may be embodied as source code and undergo compilation for implementation on processing devices or a system of devices, or may be embodied as object code, for example. Those of ordinary skill will readily understand that the term computer in its most general sense encompasses programmable devices such as those referred to above, and data processing apparatus, computer systems and the like.

Preferably, the computer programs are stored on carrier media in machine or device readable form, for example in solid-state memory or magnetic memory such as disk or tape, and processing devices utilize the programs or parts thereof to configure themselves for operation. The computer programs may be supplied from remote sources embodied in communications media, such as electronic signals, radio frequency carrier waves, optical carrier waves and the like. Such carrier media are also contemplated as aspects of the present invention.

It will thus be seen that the objects set forth above, among those made apparent from the preceding description, are efficiently attained and, since certain changes may be made in carrying out the above method and in the system set forth without departing from the spirit and scope of the invention, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

It is also to be understood that the following claims are intended to cover all of the generic and specific features of the invention herein described and all statements of the scope of the invention which, as a matter of language, might be said to fall therebetween. 

1. A system for calculating the contribution of each of a plurality of variables in a statistical model including a scoring formula for generating a score comprising a database for storing values associated with at least some of the plurality of variables, means for calculating a slope for any of the plurality of variables, means for calculating a deviance value for any of the plurality of variables and means for calculating the contribution of any of the plurality of variables based on the calculated slope and deviance values.
 2. The system of claim 1 wherein the means for calculating the slope comprises a software module that takes the first derivative of the scoring formula with respect to the variable being analyzed.
 3. The system of claim 1 wherein the plurality of variables describe characteristics of at least one of an existing policyholder and potential policyholder and the scoring formula is used to generate a score reflective of the expected loss/premium ratio for an insurance policy.
 4. The system of claim 3 wherein the premium for the insurance policy is based on the score.
 5. The system of claim 1 further comprising means for ranking the individual variables based on the calculated contribution.
 6. The system of claim 1 wherein the means for calculating a deviance value includes a software module that receives inputs for a mean value and a standard deviation value and the deviance value is calculated using the formula: ${{Deviance}\mspace{14mu} {of}\mspace{14mu} x_{i}} = \frac{\left( {x_{i} - \mu_{i}} \right)}{\sigma_{i}}$ where μ_(i) is the mean for x_(i) and σ_(i) is the standard deviation for predicitve variable x.
 7. The system of claim 1 wherein the contribution is calculated for any of the plurality of variables by multiplying the slope and deviance values.
 8. In a system that employs a statistical model comprised of a scoring formula having a plurality of predictive variables for generating a score that is representative of a risk associated with an insurance policyholder, a method of evaluating the contribution of each of the plurality of predictive variables to the score generated by the model comprising the steps of populating a database associated with the system with a mean value and standard deviation value for each of the plurality of predictive variables, calculating a slope value for each of the plurality of predictive variables, calculating a deviance value based on the mean value and the standard deviation value for each of the plurality of predictive variables, and multiplying the deviance value and slope value for each of the plurality of predictive variables to determine the contribution of each of the plurality of predictive variables to the score.
 9. The method of claim 8 further comprising the step of defining at least one assumption for the mean value associated with at least one of the plurality of predictive variables.
 10. The method of claim 8 wherein the step of calculating the slope further comprises the step of calculating the first derivative of the scoring formula with respect to the predictive variable of the plurality of predictive variables that is being analyzed.
 11. The method of claim 8 wherein the deviance value is calculated as follows: ${{Deviance}\mspace{14mu} {of}\mspace{14mu} x_{i}} = \frac{\left( {x_{i} - \mu_{i}} \right)}{\sigma_{i}}$ where μ_(i) is the mean for x_(i) and σ_(i) is the standard deviation for predicitve variable x_(i).
 12. The method of claim 8 further comprising the step of ranking each of the plurality of predictive variables based on the contribution of a predictive variable to the score wherein a predictive variable having a higher calculated contribution value is assumed to have had a greater effect on the score.
 13. A method of evaluating the contribution of each of the plurality of variables in a statistical model comprised of a scoring formula having at least one value associated with each of the plurality of variables comprising the steps of obtaining a mean value and a standard deviation value for each of the plurality of variables, calculating a slope value for each of the plurality of variables, calculating a deviance value based on the mean value and the standard deviation value for each of the plurality of variables, and multiplying the deviance value and slope value for each of the plurality of variables to quantify the contribution of each of the plurality of variables to the score.
 14. The method of claim 13 further comprising the step of populating a storage means with the mean value and standard deviation values for each of the plurality of variables.
 15. The method of claim 13 wherein the statistical model is used to assess the profitability of an insurance policy and each of the plurality of variables is associated with at least one of the policyholder and item to be insured.
 16. The method of claim 15 wherein a score generated by the model determines the price for the insurance policy and the contribution is used to identify which variables had the greatest effect on the price.
 17. In a system that employs a statistical model comprised of a scoring formula having a plurality of predictive variables for generating a score that is representative of a risk associated with an insurance policyholder and for pricing a particular coverage based on the score, a method of quantifying the contribution of each of the plurality of predictive variables to the score generated by the model comprising the steps of populating a database associated with the system with a mean value and a standard deviation value for each of the plurality of predictive variables, calculating a slope value for each of the plurality of predictive variables, calculating a deviance value based on the mean value and the standard deviation value for each of the plurality of predictive variables, and multiplying the deviance value and slope value for each of the plurality of predictive variables to quantify the contribution of each of the plurality of predictive variables to the score.
 18. The method of claim 17 further comprising the step of ranking each of the plurality of variables based on the quantified contribution as calculated for each of the plurality of predictive variables.
 19. The method of claim 17 wherein the step of calculating the slope further comprises the step of calculating the first derivative of the scoring formula with respect to a predictive variable of the plurality of predictive variables that is being analyzed.
 20. The method of claim 17 wherein the deviance value is calculated as follows: ${{Deviance}\mspace{14mu} {of}\mspace{14mu} x_{i}} = \frac{\left( {x_{i} - \mu_{i}} \right)}{\sigma_{i}}$ where μ_(i) is the mean for x_(i) and σ_(i) is the standard deviation for predicitve variable x_(i). 